Data Science with
DAY 2 pm
The website Gapminder has a large colection of data sets, mostly in excel format.
We will retrieve the data about Adults with HIV (estimated prevalence of HIV in percentage, ages 15-49) from Gapminder. The url is https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx
The observational units are the countries, a fixed variable is the year the estimated prevalence corresponds to and the measured variable is the estimated prevalence.
The function read_excel() cannot download excel files directly from the web.
We use the function download.file() to download the file into a directory and then we use read_excel() to read it into R.
#required libraries
library(dplyr)
library(tidyr)
library(stringr)
library(readxl)
library(ggplot2)
library(ggrepel)
url <- "https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx"
download.file(url, "DataFiles/HIV.xlsx")
HIV <- read_excel("DataFiles/HIV.xlsx")
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame': 275 obs. of 34 variables:
## $ Estimated HIV Prevalence% - (Ages 15-49): chr "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
## $ 1979.0 : num NA NA NA NA NA ...
## $ 1980.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1981.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1982.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1983.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1984.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1985.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1986.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1987.0 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1988.0 : logi NA NA NA NA NA NA ...
## $ 1989.0 : logi NA NA NA NA NA NA ...
## $ 1990.0 : num NA NA NA NA 0.06 NA NA 0.5 NA NA ...
## $ 1991.0 : num NA NA NA NA 0.06 NA NA 0.8 NA NA ...
## $ 1992.0 : num NA NA NA NA 0.06 NA NA 1 NA NA ...
## $ 1993.0 : num NA NA NA NA 0.06 NA NA 1.2 NA NA ...
## $ 1994.0 : num NA NA NA NA 0.06 NA NA 1.4 NA NA ...
## $ 1995.0 : num NA NA NA NA 0.06 NA NA 1.6 NA NA ...
## $ 1996.0 : num NA NA NA NA 0.06 NA NA 1.7 NA NA ...
## $ 1997.0 : num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ 1998.0 : num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ 1999.0 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2000.0 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2001.0 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2002.0 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2003.0 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2004.0 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2005.0 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2006.0 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2007.0 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2008.0 : num NA NA NA NA 0.1 NA NA 2 NA NA ...
## $ 2009 : chr NA "0.06" NA NA ...
## $ 2010 : chr NA "0.06" NA NA ...
## $ 2011 : chr NA "0.06" NA NA ...
head(HIV)
## # A tibble: 6 x 34
## `Estimated HIV … `1979.0` `1980.0` `1981.0` `1982.0` `1983.0` `1984.0`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhazia NA NA NA NA NA NA
## 2 Afghanistan NA NA NA NA NA NA
## 3 Akrotiri and Dh… NA NA NA NA NA NA
## 4 Albania NA NA NA NA NA NA
## 5 Algeria NA NA NA NA NA NA
## 6 American Samoa NA NA NA NA NA NA
## # ... with 27 more variables: `1985.0` <dbl>, `1986.0` <dbl>,
## # `1987.0` <dbl>, `1988.0` <lgl>, `1989.0` <lgl>, `1990.0` <dbl>,
## # `1991.0` <dbl>, `1992.0` <dbl>, `1993.0` <dbl>, `1994.0` <dbl>,
## # `1995.0` <dbl>, `1996.0` <dbl>, `1997.0` <dbl>, `1998.0` <dbl>,
## # `1999.0` <dbl>, `2000.0` <dbl>, `2001.0` <dbl>, `2002.0` <dbl>,
## # `2003.0` <dbl>, `2004.0` <dbl>, `2005.0` <dbl>, `2006.0` <dbl>,
## # `2007.0` <dbl>, `2008.0` <dbl>, `2009` <chr>, `2010` <chr>,
## # `2011` <chr>
names(HIV)
## [1] "Estimated HIV Prevalence% - (Ages 15-49)"
## [2] "1979.0"
## [3] "1980.0"
## [4] "1981.0"
## [5] "1982.0"
## [6] "1983.0"
## [7] "1984.0"
## [8] "1985.0"
## [9] "1986.0"
## [10] "1987.0"
## [11] "1988.0"
## [12] "1989.0"
## [13] "1990.0"
## [14] "1991.0"
## [15] "1992.0"
## [16] "1993.0"
## [17] "1994.0"
## [18] "1995.0"
## [19] "1996.0"
## [20] "1997.0"
## [21] "1998.0"
## [22] "1999.0"
## [23] "2000.0"
## [24] "2001.0"
## [25] "2002.0"
## [26] "2003.0"
## [27] "2004.0"
## [28] "2005.0"
## [29] "2006.0"
## [30] "2007.0"
## [31] "2008.0"
## [32] "2009"
## [33] "2010"
## [34] "2011"
The name of the column with country names contains the title of the worksheet. The other columns contain the prevalence of HIV by year but some of the column names seem to be numerical. It is best to skip the first row containing column names and assign these in R.
HIV <- read_excel("DataFiles/HIV.xlsx", skip =1, col_names = F)
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame': 275 obs. of 34 variables:
## $ X__1 : chr "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
## $ X__2 : num NA NA NA NA NA ...
## $ X__3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__5 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__6 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__7 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__8 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__9 : num NA NA NA NA NA NA NA NA NA NA ...
## $ X__10: num NA NA NA NA NA NA NA NA NA NA ...
## $ X__11: logi NA NA NA NA NA NA ...
## $ X__12: logi NA NA NA NA NA NA ...
## $ X__13: num NA NA NA NA 0.06 NA NA 0.5 NA NA ...
## $ X__14: num NA NA NA NA 0.06 NA NA 0.8 NA NA ...
## $ X__15: num NA NA NA NA 0.06 NA NA 1 NA NA ...
## $ X__16: num NA NA NA NA 0.06 NA NA 1.2 NA NA ...
## $ X__17: num NA NA NA NA 0.06 NA NA 1.4 NA NA ...
## $ X__18: num NA NA NA NA 0.06 NA NA 1.6 NA NA ...
## $ X__19: num NA NA NA NA 0.06 NA NA 1.7 NA NA ...
## $ X__20: num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ X__21: num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ X__22: num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ X__23: num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ X__24: num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ X__25: num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ X__26: num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ X__27: num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ X__28: num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ X__29: num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ X__30: num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ X__31: num NA NA NA NA 0.1 NA NA 2 NA NA ...
## $ X__32: chr NA "0.06" NA NA ...
## $ X__33: chr NA "0.06" NA NA ...
## $ X__34: chr NA "0.06" NA NA ...
head(HIV)
## # A tibble: 6 x 34
## X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10 X__11 X__12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 Abkha… NA NA NA NA NA NA NA NA NA NA NA
## 2 Afgha… NA NA NA NA NA NA NA NA NA NA NA
## 3 Akrot… NA NA NA NA NA NA NA NA NA NA NA
## 4 Alban… NA NA NA NA NA NA NA NA NA NA NA
## 5 Alger… NA NA NA NA NA NA NA NA NA NA NA
## 6 Ameri… NA NA NA NA NA NA NA NA NA NA NA
## # ... with 22 more variables: X__13 <dbl>, X__14 <dbl>, X__15 <dbl>,
## # X__16 <dbl>, X__17 <dbl>, X__18 <dbl>, X__19 <dbl>, X__20 <dbl>,
## # X__21 <dbl>, X__22 <dbl>, X__23 <dbl>, X__24 <dbl>, X__25 <dbl>,
## # X__26 <dbl>, X__27 <dbl>, X__28 <dbl>, X__29 <dbl>, X__30 <dbl>,
## # X__31 <dbl>, X__32 <chr>, X__33 <chr>, X__34 <chr>
names(HIV)
## [1] "X__1" "X__2" "X__3" "X__4" "X__5" "X__6" "X__7" "X__8"
## [9] "X__9" "X__10" "X__11" "X__12" "X__13" "X__14" "X__15" "X__16"
## [17] "X__17" "X__18" "X__19" "X__20" "X__21" "X__22" "X__23" "X__24"
## [25] "X__25" "X__26" "X__27" "X__28" "X__29" "X__30" "X__31" "X__32"
## [33] "X__33" "X__34"
aux <- seq(1979, 2011, 1)
names(HIV) <- c("Country", as.character(aux))
head(HIV)
## # A tibble: 6 x 34
## Country `1979` `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhaz… NA NA NA NA NA NA NA NA NA
## 2 Afghan… NA NA NA NA NA NA NA NA NA
## 3 Akroti… NA NA NA NA NA NA NA NA NA
## 4 Albania NA NA NA NA NA NA NA NA NA
## 5 Algeria NA NA NA NA NA NA NA NA NA
## 6 Americ… NA NA NA NA NA NA NA NA NA
## # ... with 24 more variables: `1988` <lgl>, `1989` <lgl>, `1990` <dbl>,
## # `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
## # `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
## # `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <chr>, `2010` <chr>,
## # `2011` <chr>
The last three columns have been read as character (?) and the columns corresponding to 1988 and 1989 are of class logical because all their entries are NAs.
Let us coerce the last three columns to numeric mode.
HIV <- HIV %>%
mutate_at(32:34, as.numeric)
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame': 275 obs. of 34 variables:
## $ Country: chr "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
## $ 1979 : num NA NA NA NA NA ...
## $ 1980 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1981 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1982 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1983 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1984 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1985 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1986 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1987 : num NA NA NA NA NA NA NA NA NA NA ...
## $ 1988 : logi NA NA NA NA NA NA ...
## $ 1989 : logi NA NA NA NA NA NA ...
## $ 1990 : num NA NA NA NA 0.06 NA NA 0.5 NA NA ...
## $ 1991 : num NA NA NA NA 0.06 NA NA 0.8 NA NA ...
## $ 1992 : num NA NA NA NA 0.06 NA NA 1 NA NA ...
## $ 1993 : num NA NA NA NA 0.06 NA NA 1.2 NA NA ...
## $ 1994 : num NA NA NA NA 0.06 NA NA 1.4 NA NA ...
## $ 1995 : num NA NA NA NA 0.06 NA NA 1.6 NA NA ...
## $ 1996 : num NA NA NA NA 0.06 NA NA 1.7 NA NA ...
## $ 1997 : num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ 1998 : num NA NA NA NA 0.06 NA NA 1.8 NA NA ...
## $ 1999 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2000 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2001 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2002 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2003 : num NA NA NA NA 0.06 NA NA 1.9 NA NA ...
## $ 2004 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2005 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2006 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2007 : num NA NA NA NA 0.1 NA NA 1.9 NA NA ...
## $ 2008 : num NA NA NA NA 0.1 NA NA 2 NA NA ...
## $ 2009 : num NA 0.06 NA NA NA NA NA 2.1 NA NA ...
## $ 2010 : num NA 0.06 NA NA NA NA NA 2.1 NA NA ...
## $ 2011 : num NA 0.06 NA NA NA NA NA 2.1 NA NA ...
The columns up to 1990 are mostly NAs and so we will remove them from the data set
#keep only columns 13 to 34
HIV <- select(HIV, c(1,13:34))
HIV
## # A tibble: 275 x 23
## Country `1990` `1991` `1992` `1993` `1994` `1995` `1996` `1997` `1998`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhaz… NA NA NA NA NA NA NA NA NA
## 2 Afghan… NA NA NA NA NA NA NA NA NA
## 3 Akroti… NA NA NA NA NA NA NA NA NA
## 4 Albania NA NA NA NA NA NA NA NA NA
## 5 Algeria 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
## 6 Americ… NA NA NA NA NA NA NA NA NA
## 7 Andorra NA NA NA NA NA NA NA NA NA
## 8 Angola 0.5 0.8 1 1.2 1.4 1.6 1.7 1.8 1.8
## 9 Anguil… NA NA NA NA NA NA NA NA NA
## 10 Antigu… NA NA NA NA NA NA NA NA NA
## # ... with 265 more rows, and 13 more variables: `1999` <dbl>,
## # `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## # `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>
Now, let us tidy the data ready for analysis.
An observational unit is a country and the variables are year and prevalence of HIV. So, the tidy version of the data has three columns: country, year and prevalence.
Let us gather the columns with a year number into one single column named Year and put the corresponding values of prevalence under a column named PrevalenceHIV
HIV2 <- gather(HIV, "Year", "PrevalenceHIV", -Country)
glimpse(HIV2)
## Observations: 6,050
## Variables: 3
## $ Country <chr> "Abkhazia", "Afghanistan", "Akrotiri and Dhekeli...
## $ Year <chr> "1990", "1990", "1990", "1990", "1990", "1990", ...
## $ PrevalenceHIV <dbl> NA, NA, NA, NA, 0.06, NA, NA, 0.50, NA, NA, 0.30...
The data is tidy.
Let us visualise some of the data.
We will visualise using the concepts and additional data in Gapminder.org.
The HIV prevalence data will be plotted vs. Income (GDP per capita, PPP$ inflation-adjusted). The income data in Gapminder is in excel format in the url https://docs.google.com/spreadsheets/d/1PybxH399kK6OjJI4T2M33UsLqgutwj3SuYbk7Yt6sxE/pub. The data has already been downloaded and is in the file “gdp_per_capita_ppp.xlsx” in the current working directory. Note how we are going back to the beginning of the data analysis process in order to make our data exploration more meaningful.
income <- read_excel("DataFiles/gdp_per_capita_ppp.xlsx")
head(income)
## # A tibble: 6 x 217
## `GDP per capita` `1800.0` `1801.0` `1802.0` `1803.0` `1804.0` `1805.0`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abkhazia NA NA NA NA NA NA
## 2 Afghanistan 603 603 603 603 603 603
## 3 Akrotiri and Dh… NA NA NA NA NA NA
## 4 Albania 667 667 668 668 668 668
## 5 Algeria 716 716 717 718 719 720
## 6 American Samoa NA NA NA NA NA NA
## # ... with 210 more variables: `1806.0` <dbl>, `1807.0` <dbl>,
## # `1808.0` <dbl>, `1809.0` <dbl>, `1810.0` <dbl>, `1811.0` <dbl>,
## # `1812.0` <dbl>, `1813.0` <dbl>, `1814.0` <dbl>, `1815.0` <dbl>,
## # `1816.0` <dbl>, `1817.0` <dbl>, `1818.0` <dbl>, `1819.0` <dbl>,
## # `1820.0` <dbl>, `1821.0` <dbl>, `1822.0` <dbl>, `1823.0` <dbl>,
## # `1824.0` <dbl>, `1825.0` <dbl>, `1826.0` <dbl>, `1827.0` <dbl>,
## # `1828.0` <dbl>, `1829.0` <dbl>, `1830.0` <dbl>, `1831.0` <dbl>,
## # `1832.0` <dbl>, `1833.0` <dbl>, `1834.0` <dbl>, `1835.0` <dbl>,
## # `1836.0` <dbl>, `1837.0` <dbl>, `1838.0` <dbl>, `1839.0` <dbl>,
## # `1840.0` <dbl>, `1841.0` <dbl>, `1842.0` <dbl>, `1843.0` <dbl>,
## # `1844.0` <dbl>, `1845.0` <dbl>, `1846.0` <dbl>, `1847.0` <dbl>,
## # `1848.0` <dbl>, `1849.0` <dbl>, `1850.0` <dbl>, `1851.0` <dbl>,
## # `1852.0` <dbl>, `1853.0` <dbl>, `1854.0` <dbl>, `1855.0` <dbl>,
## # `1856.0` <dbl>, `1857.0` <dbl>, `1858.0` <dbl>, `1859.0` <dbl>,
## # `1860.0` <dbl>, `1861.0` <dbl>, `1862.0` <dbl>, `1863.0` <dbl>,
## # `1864.0` <dbl>, `1865.0` <dbl>, `1866.0` <dbl>, `1867.0` <dbl>,
## # `1868.0` <dbl>, `1869.0` <dbl>, `1870.0` <dbl>, `1871.0` <dbl>,
## # `1872.0` <dbl>, `1873.0` <dbl>, `1874.0` <dbl>, `1875.0` <dbl>,
## # `1876.0` <dbl>, `1877.0` <dbl>, `1878.0` <dbl>, `1879.0` <dbl>,
## # `1880.0` <dbl>, `1881.0` <dbl>, `1882.0` <dbl>, `1883.0` <dbl>,
## # `1884.0` <dbl>, `1885.0` <dbl>, `1886.0` <dbl>, `1887.0` <dbl>,
## # `1888.0` <dbl>, `1889.0` <dbl>, `1890.0` <dbl>, `1891.0` <dbl>,
## # `1892.0` <dbl>, `1893.0` <dbl>, `1894.0` <dbl>, `1895.0` <dbl>,
## # `1896.0` <dbl>, `1897.0` <dbl>, `1898.0` <dbl>, `1899.0` <dbl>,
## # `1900.0` <dbl>, `1901.0` <dbl>, `1902.0` <dbl>, `1903.0` <dbl>,
## # `1904.0` <dbl>, `1905.0` <dbl>, …
Note how the column with country names has been named GDP per capita and the year column names have a .0 format. Let us change the column names and, as usual, gather columns with year value names in one single column and create a column with income data for each country and year.
names(income) <- c("Country", as.character(seq(1800, 2015, 1)))
income2 <- gather(income, "Year", "Income", -Country)
glimpse(income2)
## Observations: 56,592
## Variables: 3
## $ Country <chr> "Abkhazia", "Afghanistan", "Akrotiri and Dhekelia", "A...
## $ Year <chr> "1800", "1800", "1800", "1800", "1800", "1800", "1800"...
## $ Income <dbl> NA, 603, NA, 667, 716, NA, 1197, 618, NA, 757, 1507, 5...
This looks better!
Do HIV2 and income2 contain the same countries?
nrow(HIV2)
## [1] 6050
nrow(income2)
## [1] 56592
We need to combine the HIV2 and income3 data sets but, from the results above above, we observe that they have a different number of rows (countries). So, we must intersect both data sets, i.e. merge them leaving out the data for countries which are not in both data sets. We use the function inner_join() (package dplyr) which, by default, will do a natural join, using all variables with common names across the two tables (in this case Country and Year).
HIV_Inc <- inner_join(HIV2, income2)#merging HIV2 and income2
## Joining, by = c("Country", "Year")
HIV_Inc
## # A tibble: 5,720 x 4
## Country Year PrevalenceHIV Income
## <chr> <chr> <dbl> <dbl>
## 1 Abkhazia 1990 NA NA
## 2 Afghanistan 1990 NA 1028
## 3 Akrotiri and Dhekelia 1990 NA NA
## 4 Albania 1990 NA 4350
## 5 Algeria 1990 0.06 10113
## 6 American Samoa 1990 NA NA
## 7 Andorra 1990 NA 28417
## 8 Angola 1990 0.5 4232
## 9 Anguilla 1990 NA NA
## 10 Antigua and Barbuda 1990 NA 17154
## # ... with 5,710 more rows
Checking that we got the right number of rows in HIV_inc
aux <- intersect(HIV$Country, income$Country) # this vector contains the common countries in HIV and income
length(aux) * 22 # we are considering 22 years per country so this number should be equal to the number of rows in HIV_Inc
## [1] 5720
nrow(HIV_Inc) # the number of rows in HIV_Inc
## [1] 5720
To add more interest to our visualisation, we add region (continent, sub-continent) information downloaded from https://www.gapminder.org/data/geo/ into the file “DataGeographiesGapminder.xlsx”. This is a workbook with many sheets. The second sheet is the one that contain the list of country names and different region denominations and other geographical information.
continent <- read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = 2)# read only the second sheet
head(continent)
## # A tibble: 6 x 11
## geo name four_regions eight_regions six_regions members_oecd_g77
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 afg Afgh… asia asia_west south_asia g77
## 2 alb Alba… europe europe_east europe_cen… others
## 3 dza Alge… africa africa_north middle_eas… g77
## 4 and Ando… europe europe_west europe_cen… others
## 5 ago Ango… africa africa_sub_s… sub_sahara… g77
## 6 atg Anti… americas america_north america g77
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## # since` <dttm>, `World bank region` <chr>, `World bank income group
## # 2017` <chr>
continent <- rename(continent, Country = name)
glimpse(continent)
## Observations: 197
## Variables: 11
## $ geo <chr> "afg", "alb", "dza", "and", "ag...
## $ Country <chr> "Afghanistan", "Albania", "Alge...
## $ four_regions <chr> "asia", "europe", "africa", "eu...
## $ eight_regions <chr> "asia_west", "europe_east", "af...
## $ six_regions <chr> "south_asia", "europe_central_a...
## $ members_oecd_g77 <chr> "g77", "others", "g77", "others...
## $ Latitude <dbl> 33.00000, 41.00000, 28.00000, 4...
## $ Longitude <dbl> 66.00000, 20.00000, 3.00000, 1....
## $ `UN member since` <dttm> 1946-11-19, 1955-12-14, 1962-1...
## $ `World bank region` <chr> "South Asia", "Europe & Central...
## $ `World bank income group 2017` <chr> "Low income", "Upper middle inc...
Next we merge this information with HIV and income data
HIV_Inc_Cont <- inner_join(HIV_Inc, continent)
## Joining, by = "Country"
glimpse(HIV_Inc_Cont)
## Observations: 4,312
## Variables: 14
## $ Country <chr> "Afghanistan", "Albania", "Alge...
## $ Year <chr> "1990", "1990", "1990", "1990",...
## $ PrevalenceHIV <dbl> NA, NA, 0.06, NA, 0.50, NA, 0.3...
## $ Income <dbl> 1028, 4350, 10113, 28417, 4232,...
## $ geo <chr> "afg", "alb", "dza", "and", "ag...
## $ four_regions <chr> "asia", "europe", "africa", "eu...
## $ eight_regions <chr> "asia_west", "europe_east", "af...
## $ six_regions <chr> "south_asia", "europe_central_a...
## $ members_oecd_g77 <chr> "g77", "others", "g77", "others...
## $ Latitude <dbl> 33.00000, 41.00000, 28.00000, 4...
## $ Longitude <dbl> 66.00000, 20.00000, 3.00000, 1....
## $ `UN member since` <dttm> 1946-11-19, 1955-12-14, 1962-1...
## $ `World bank region` <chr> "South Asia", "Europe & Central...
## $ `World bank income group 2017` <chr> "Low income", "Upper middle inc...
The plan is to plot prevalence vs. income distinguishing with colours by continent and having 5 parallel plots, one for each of years 1990, 1995, 2000, 2005, 2011. The filter() function of the dplyr package serves to subset the data according to a logical criterion.
aux <- filter(HIV_Inc_Cont, Year %in% c("1990", "1995", "2000", "2005", "2011"))
aux %>%
ggplot(aes(x = Income, y = PrevalenceHIV, col = four_regions) ) +
geom_point(alpha=0.8) +
labs(x = "GDP per capita ($) - inflation adjusted" ) +
labs(y = "Estimated HIV prevalence (%)" ) +
ggtitle("Plot of HIV prevalence vs income - all nations") +
facet_grid(.~Year) + # one plot for each of the desired years
theme(legend.position = "bottom")
## Warning: Removed 249 rows containing missing values (geom_point).
As we can see in the plots above and below, most African countries have prevalence values in a scale which is about ten times that of the rest of the world. This makes the visualisation difficult and we will visualise the data for African countries separately.
ggplot(HIV_Inc_Cont, aes(x = four_regions, y = PrevalenceHIV)) +
geom_boxplot()
#Only Africa
aux2 <- filter(aux, four_regions == "africa")# we further filter the data to select only countries in Africa
p_africa <- ggplot(aux2, aes(x = Income, y = PrevalenceHIV) ) +
geom_point(alpha = 0.8, color = "green", show.legend = FALSE) +
labs(x = "GDP per capita ($) - inflation adjusted" ) +
labs(y = "Estimated HIV prevalence (%)" ) +
ggtitle("Plot of HIV prevalence vs income - Africa") +
facet_grid(.~Year)
p_africa
To gain more insight, let us identify the African countries with HIV prevalence greater than or equal to 10%.
#for year 1990
x_90 <- filter(aux2, PrevalenceHIV >= 10 & Year == "1990") #further filter the data selecting prevalence>=10 and year 1990
select(x_90, 1:5)
## # A tibble: 3 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Uganda 1990 10.2 767 uga
## 2 Zambia 1990 12.7 2407 zmb
## 3 Zimbabwe 1990 10.1 2532 zwe
#for year 1995
x_95 <- filter(aux2, PrevalenceHIV >= 10 & Year == "1995")
select(x_95, 1:5)
## # A tibble: 7 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Botswana 1995 16.6 8823 bwa
## 2 Kenya 1995 10.3 2199 ken
## 3 Lesotho 1995 14.3 1466 lso
## 4 Malawi 1995 13.9 593 mwi
## 5 Swaziland 1995 10.6 5043 swz
## 6 Zambia 1995 15 2106 zmb
## 7 Zimbabwe 1995 25.1 2416 zwe
#for year 2000
x_00 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2000")
select(x_00, 1:5)
## # A tibble: 8 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Botswana 2000 26 10250 bwa
## 2 Lesotho 2000 24.5 1629 lso
## 3 Malawi 2000 14.2 632 mwi
## 4 Namibia 2000 15.3 6111 nam
## 5 South Africa 2000 16.1 9927 zaf
## 6 Swaziland 2000 22.3 5257 swz
## 7 Zambia 2000 14.4 2202 zmb
## 8 Zimbabwe 2000 24.8 2521 zwe
#for year 2005
x_05 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2005")
select(x_05, 1:5)
## # A tibble: 9 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Botswana 2005 25.5 11460 bwa
## 2 Lesotho 2005 23.6 1810 lso
## 3 Malawi 2005 12.1 609 mwi
## 4 Mozambique 2005 11.2 774 moz
## 5 Namibia 2005 15.7 7279 nam
## 6 South Africa 2005 18.1 11133 zaf
## 7 Swaziland 2005 25.6 5618 swz
## 8 Zambia 2005 13.9 2620 zmb
## 9 Zimbabwe 2005 18.4 1689 zwe
#for year 2011
x_11 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2011")
select(x_11, 1:5)
## # A tibble: 9 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Botswana 2011 23.4 14341 bwa
## 2 Lesotho 2011 23.3 2301 lso
## 3 Malawi 2011 10 747 mwi
## 4 Mozambique 2011 11.3 974 moz
## 5 Namibia 2011 13.4 8715 nam
## 6 South Africa 2011 17.3 12291 zaf
## 7 Swaziland 2011 26 5846 swz
## 8 Zambia 2011 12.5 3557 zmb
## 9 Zimbabwe 2011 14.9 1626 zwe
library(ggrepel)
To add country name labels to the plots we use the function geom_text_repel() in the package ggrepel. The most important feature of ggrepel is that it avoids that the labels overlap when the point they identify are very near.
#Let us add the names of the countries with high HIV prevalence to the plots.
p_africa <- p_africa +
geom_text_repel(data = x_90, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_95, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_00, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_05, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_11, aes(label = geo) , col = "black", size = 3)
p_africa
Note that countries with missing data are not in the plot.
Which countries are getting richer? Is that reflecting on the HIV prevalence?
#for year 1990
x_90 <- filter(aux2, Income >= 15000 & Year == "1990") # select african countries data for year 1990 and income>=15000
x_90
## # A tibble: 2 x 14
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Gabon 1990 0.9 19358 gab africa africa_sub_s…
## 2 Libya 1990 NA 26928 lby africa africa_north
## # ... with 7 more variables: six_regions <chr>, members_oecd_g77 <chr>,
## # Latitude <dbl>, Longitude <dbl>, `UN member since` <dttm>, `World bank
## # region` <chr>, `World bank income group 2017` <chr>
#for year 1995
x_95 <- filter(aux2, Income >= 15000 & Year == "1995")
select(x_95, 1:5)
## # A tibble: 3 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Gabon 1995 3.1 19738 gab
## 2 Libya 1995 NA 23363 lby
## 3 Seychelles 1995 NA 15097 syc
#for year 2000
x_00 <- filter(aux2, Income >= 15000 & Year == "2000")
select(x_00, 1:5)
## # A tibble: 3 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Gabon 2000 5.2 17630 gab
## 2 Libya 2000 NA 22682 lby
## 3 Seychelles 2000 NA 18453 syc
#for year 2005
x_05 <- filter(aux2, Income >= 15000 & Year == "2005")
select(x_05, 1:5)
## # A tibble: 4 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Equatorial Guinea 2005 3.6 36200 gnq
## 2 Gabon 2005 5.4 17069 gab
## 3 Libya 2005 NA 26967 lby
## 4 Seychelles 2005 NA 17803 syc
#for year 2011
x_11 <- filter(aux2, Income >= 15000 & Year == "2011")
select(x_11, 1:5)
## # A tibble: 4 x 5
## Country Year PrevalenceHIV Income geo
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Equatorial Guinea 2011 4.7 35150 gnq
## 2 Gabon 2011 5 16590 gab
## 3 Mauritius 2011 1 16179 mus
## 4 Seychelles 2011 NA 22556 syc
#Let us add the names of the countries with high income to the plots.
p_africa <- p_africa +
geom_text_repel(data = x_90, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_95, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_00, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_05, aes(label = geo) , col = "black", size = 3) +
geom_text_repel(data = x_11, aes(label = geo) , col = "black", size = 3)
p_africa
library(plotly)
ggplotly(p_africa)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
## If you'd like to see this geom implemented,
## Please open an issue with your example code at
## https://github.com/ropensci/plotly/issues
Similar analysis for the rest of the continents
aux2_r <- filter(aux, four_regions != "africa")
p_rest <- ggplot(aux2_r, aes(x = Income, y = PrevalenceHIV, col = four_regions) ) +
geom_point(alpha=0.8) +
labs(x = "GDP per capita ($) - inflation adjusted" ) +
labs(y = "Estimated HIV prevalence (%)" ) +
ggtitle("Plot of HIV prevalence vs income - Americas, Asia and Europe") +
facet_grid(.~Year) +
theme(legend.position = "bottom")
p_rest
Let us identify the countries with HIV prevalence greater than or equal to 1%.
#for year 1990
x_90 <- filter(aux2_r, PrevalenceHIV >= 1, Year == "1990")
x_90
## # A tibble: 6 x 14
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Bahamas 1990 3.6 24281 bhs americas america_north
## 2 Guyana 1990 2.8 3231 guy americas america_south
## 3 Haiti 1990 1.3 2242 hti americas america_north
## 4 Hondur… 1990 1.1 3205 hnd americas america_north
## 5 Jamaica 1990 2.1 7391 jam americas america_north
## 6 Thaila… 1990 1 6369 tha asia east_asia_pa…
## # ... with 7 more variables: six_regions <chr>, members_oecd_g77 <chr>,
## # Latitude <dbl>, Longitude <dbl>, `UN member since` <dttm>, `World bank
## # region` <chr>, `World bank income group 2017` <chr>
#for year 1995
x_95 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "1995")
select(x_95, 1:7)
## # A tibble: 9 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Bahamas 1995 3.7 22119 bhs americas america_north
## 2 Belize 1995 1.5 6209 blz americas america_north
## 3 Cambodia 1995 1.4 1091 khm asia east_asia_pacific
## 4 Guyana 1995 2.2 4533 guy americas america_south
## 5 Haiti 1995 3.6 1672 hti americas america_north
## 6 Honduras 1995 1.5 3344 hnd americas america_north
## 7 Jamaica 1995 2.2 8644 jam americas america_north
## 8 Panama 1995 1.6 8795 pan americas america_north
## 9 Thailand 1995 2.1 9239 tha asia east_asia_pacific
#for year 2000
x_00 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2000")
select(x_00, 1:7)
## # A tibble: 12 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Bahamas 2000 3.2 25858 bhs americas america_north
## 2 Belize 2000 2.2 7215 blz americas america_north
## 3 Cambodia 2000 1.3 1368 khm asia east_asia_pac…
## 4 Dominican… 2000 1 7955 dom americas america_north
## 5 Guyana 2000 1.5 5071 guy americas america_south
## 6 Haiti 2000 2.8 1734 hti americas america_north
## 7 Honduras 2000 1.3 3483 hnd americas america_north
## 8 Jamaica 2000 1.9 8139 jam americas america_north
## 9 Panama 2000 1.4 9954 pan americas america_north
## 10 Suriname 2000 1 9908 sur americas america_south
## 11 Thailand 2000 1.8 8939 tha asia east_asia_pac…
## 12 Trinidad … 2000 1.2 17721 tto americas america_north
#for year 2005
x_05 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2005")
select(x_05, 1:7)
## # A tibble: 11 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Bahamas 2005 3 25397 bhs americas america_north
## 2 Belize 2005 2.4 8202 blz americas america_north
## 3 Estonia 2005 1.1 21651 est europe europe_east
## 4 Guyana 2005 1.1 5140 guy americas america_south
## 5 Haiti 2005 2.1 1562 hti americas america_north
## 6 Jamaica 2005 1.8 8803 jam americas america_north
## 7 Panama 2005 1.1 11156 pan americas america_north
## 8 Suriname 2005 1.1 12225 sur americas america_south
## 9 Thailand 2005 1.5 10901 tha asia east_asia_pac…
## 10 Trinidad … 2005 1.3 25439 tto americas america_north
## 11 Ukraine 2005 1.1 7265 ukr europe europe_east
#for year 2011
x_11 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2011")
select(x_05, 1:7)
## # A tibble: 11 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Bahamas 2005 3 25397 bhs americas america_north
## 2 Belize 2005 2.4 8202 blz americas america_north
## 3 Estonia 2005 1.1 21651 est europe europe_east
## 4 Guyana 2005 1.1 5140 guy americas america_south
## 5 Haiti 2005 2.1 1562 hti americas america_north
## 6 Jamaica 2005 1.8 8803 jam americas america_north
## 7 Panama 2005 1.1 11156 pan americas america_north
## 8 Suriname 2005 1.1 12225 sur americas america_south
## 9 Thailand 2005 1.5 10901 tha asia east_asia_pac…
## 10 Trinidad … 2005 1.3 25439 tto americas america_north
## 11 Ukraine 2005 1.1 7265 ukr europe europe_east
p_rest <- p_rest +
geom_text_repel(data = x_90, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_95, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_00, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_05, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_11, aes(label = geo, col = four_regions) , size = 3)
p_rest
Which countries are getting richer? Is that reflecting on the HIV prevalence?
#for year 1990
x_90 <- filter(aux2_r, Income >= 50000 & Year == "1990")
select(x_90, 1:7)
## # A tibble: 4 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Brunei 1990 NA 77076 brn asia east_asia_pac…
## 2 Luxembourg 1990 0.1 56922 lux europe europe_west
## 3 Qatar 1990 0.06 73402 qat asia asia_west
## 4 United Ara… 1990 NA 114832 are asia asia_west
#for year 1995
x_95 <- filter(aux2_r, Income >= 50000 & Year == "1995")
select(x_95, 1:7)
## # A tibble: 6 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Brunei 1995 NA 78406 brn asia east_asia_pac…
## 2 Kuwait 1995 NA 82268 kwt asia asia_west
## 3 Luxembourg 1995 0.2 64568 lux europe europe_west
## 4 Norway 1995 0.1 50616 nor europe europe_west
## 5 Qatar 1995 0.06 77809 qat asia asia_west
## 6 United Ara… 1995 NA 106425 are asia asia_west
#for year 2000
x_00 <- filter(aux2_r, Income >= 50000 & Year == "2000")
select(x_00, 1:7)
## # A tibble: 9 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Brunei 2000 NA 74475 brn asia east_asia_pac…
## 2 Kuwait 2000 NA 75219 kwt asia asia_west
## 3 Luxembourg 2000 0.2 81425 lux europe europe_west
## 4 Monaco 2000 NA 50200 mco europe europe_west
## 5 Norway 2000 0.1 58699 nor europe europe_west
## 6 Qatar 2000 0.06 112238 qat asia asia_west
## 7 San Marino 2000 NA 51350 smr europe europe_west
## 8 Singapore 2000 0.1 51663 sgp asia east_asia_pac…
## 9 United Ara… 2000 NA 108048 are asia asia_west
#for year 2005
x_05 <- filter(aux2_r, Income >= 50000 & Year == "2005")
select(x_05, 1:7)
## # A tibble: 10 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Brunei 2005 NA 74441 brn asia east_asia_pa…
## 2 Kuwait 2005 NA 92665 kwt asia asia_west
## 3 Luxembourg 2005 0.3 88944 lux europe europe_west
## 4 Monaco 2005 NA 52761 mco europe europe_west
## 5 Norway 2005 0.1 63573 nor europe europe_west
## 6 Qatar 2005 0.06 119134 qat asia asia_west
## 7 San Marino 2005 NA 53928 smr europe europe_west
## 8 Singapore 2005 0.1 61921 sgp asia east_asia_pa…
## 9 Switzerland 2005 0.3 51069 che europe europe_west
## 10 United Ara… 2005 NA 102324 are asia asia_west
#for year 2011
x_11 <- filter(aux2_r, Income >= 50000 & Year == "2011")
select(x_11, 1:7)
## # A tibble: 10 x 7
## Country Year PrevalenceHIV Income geo four_regions eight_regions
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Brunei 2011 NA 71991 brn asia east_asia_pa…
## 2 Hong Kong,… 2011 NA 50086 hkg asia east_asia_pa…
## 3 Kuwait 2011 NA 79102 kwt asia asia_west
## 4 Luxembourg 2011 0.3 91469 lux europe europe_west
## 5 Monaco 2011 NA 58081 mco europe europe_west
## 6 Norway 2011 0.1 62737 nor europe europe_west
## 7 Qatar 2011 NA 133734 qat asia asia_west
## 8 Singapore 2011 0.1 74949 sgp asia east_asia_pa…
## 9 Switzerland 2011 0.4 54551 che europe europe_west
## 10 United Ara… 2011 NA 56192 are asia asia_west
p_rest <- p_rest + geom_text_repel(data = x_90, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_95, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_00, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_05, aes(label = geo, col = four_regions) , size = 3) +
geom_text_repel(data = x_11, aes(label = geo, col = four_regions) , size = 3)
p_rest
EXERCISE: Carry out a visualisation of HIV prevalence data for the Americas, distinguishing between the sub-regions in the Americas.